Lab 1: Basic Data Wrangling and Plotting, Distributions



Submission Instructions

This lab will be submitted in pairs using GitHub (if you don’t have a pair, please contact us).
Please follow the steps in the GitHub-Classroom Lab 1 to create your group’s Lab 1 repository.
Important: your team’s name must be FamilyName1_Name1_and_FamilyName2_Name2.
You can collaborate with your partner using the git environment; You can either make commits straight to the main repository, or create individual branches (recommended). However, once done, be sure to merge your branches to master - you will be graded using the most recent master version - your last push and merge before the deadline of May 15th.
Please do not open/review other peoples’ repositories - we will be notified by GitHub if you do.

Your final push should include this Rmd file (with your answers filled-in), together with the html file that is outputted automatically by knitr when you knit the Rmd. Anything else will be disregarded. In addition, please adhere to the following file format:
Lab_1_FamilyName1_Name1_and_FamilyName2_Name2.Rmd/html



The only allowed libraries are the following (please do not add additional libraries without permission from the course staff):



Analysis of the World Covid-19 Dataset

The world-of-data website hosts world-wide epidemiological data on the Corona Virus (COVID-19). The dataset is compiled by the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) from various sources, and follows The dataset contains data since January 2020. For the data and more information about it, please visit here.

You can see several nice visualizations of the data here

we will focus on analyzing the Covid-19 cases, deaths and vaccinations data over time for different countries and continents.

General Guidance

  • Your solution should be submitted as a full Rmd report integrating text, code, figures and tables. For each question, describe first in the text of your solution what you’re trying to do, then include the relevant code, then the results (e.g. figures/tables) and then a textual description of them.

  • In most questions the extraction/manipulation of relevant parts of the data-frame can be performed using commands from the tidyverse and dplyr R packages, such as head, arrange, aggregate, group-by, filter, select, summaries, mutate, order etc.

  • When displaying tables, show the relevant columns and rows with meaningful names, and describe the results.

  • When displaying figures, make sure that the figure is clear to the reader, axis ranges are appropriate, labels for the axis , title and different curves/bars are displayed clearly (font sizes are large enough), a legend is shown when needed etc. Explain and describe in text what is shown in the figure.

  • In many cases, data are missing (e.g. NA). Make sure that all your calculations (e.g. taking the maximum, average, correlation etc.) take this into account. Specifically, the calculations should ignore the missing values to allow us to compute the desired results for the rest of the values (for example, using the option na.rm = TRUE).

  • Grading: There are \(10\) questions overall. Each question is worth \(10\) points for the lab grade. The questions vary in length and difficulty level. It is recommended to start with the simpler and shorter questions.

Questions:

  1. Loading data and basic processing:
    1. Load the complete covid19 dataset in csv format from world-of-data into a data-frame in R.
      Change if needed the class of the date variable to Date and check that the class is correct.

explantion: we save the data set by “covid-19 name” and than changed class of date to “date”

covid_19 = read.csv("/Users/shaharzismanovich/Downloads/r52414_2021_22_lab1-zismanovich_shahar_and_honen_shahaf-main/owid-covid-data.csv")
covid_19$date = as.Date(covid_19$date)
print(class(covid_19$date))
## [1] "Date"
b. List in a table the top five *dates* in terms of number of `new_cases` for `High income` countries. Show only the date and the number of new cases  at this date. <br>

Repeat the same with two additional separate tables for top five dates for new_deaths and new_vaccinations.

explanation: we used pipe function to filter the relevent details that the question deamend and then we used also the top n func, that bring us the top relevent number of data set by defining one of the columns.

high_income = subset.data.frame(covid_19, covid_19$location == 'High income')

high_income %>% top_n(5, new_cases) %>% select(new_cases, date)
##   new_cases       date
## 1   2589456 2022-01-10
## 2   2737979 2022-01-18
## 3   2830635 2022-01-19
## 4   2559027 2022-01-21
## 5   2532198 2022-01-26
high_income %>% top_n(5, new_deaths) %>% select(new_deaths, date)
##   new_deaths       date
## 1       9967 2021-01-08
## 2      10170 2021-01-12
## 3      11047 2021-01-20
## 4      10279 2021-01-26
## 5      10079 2021-01-27
high_income %>% top_n(5 ,new_vaccinations) %>% select(new_vaccinations, date)
##   new_vaccinations       date
## 1          9453347 2021-06-09
## 2          9539660 2021-06-10
## 3          9485687 2021-06-11
## 4         11981277 2021-08-04
## 5          9751526 2021-12-17
  1. Comparing low vs. high income countries:
    1. Write a function that receives as input the data-frame, and a column name as string. The function plots the value of the input column as a function of the date for both High income and Low income countries, shown on the same graph with different colors or symbols. Use meaningful axis and plot labels, and add an informative legend. NA or other no-number values should not be displayed.

2a explanation - we vreated function that creating subset by 2 parameters- high or low income, and then making plot that definding bewtween them by the relevent column the user should insert. In the plot we used ggpolt func.

low_vs_high_comp<- function(d_frame, column){
    
  high_income_f = subset.data.frame(d_frame, d_frame$location == 'High income'| d_frame$location == 'Low income')

    
  ggplot(high_income_f, aes(x = date, y= get(column))) + 
  geom_point(aes( color= location)) + 
    xlab("date")+
    ylab(column)
  }
b. Use the function written in (a.) and plot of the number of `new_cases_per_million` vs. date for the high vs. low income countries.

2.b exp- In this task we used the func we were create to make a graghic equation between high and low income location by diffrent column. x- axis show us the date and y axis the number of each relevent column. to calculate the log deatils we created new column to the data set that oparate log function on the relevent column.

low_vs_high_comp(covid_19, "new_cases_per_million" )

covid_19$log_new_cases_smoothed_per_million = log(covid_19$new_cases_smoothed_per_million)

Next, make a similar plot for the log of the smoothed number of new cases per million, new_cases_smoothed_per_million.

low_vs_high_comp(covid_19, "log_new_cases_smoothed_per_million" )
## Warning: Removed 10 rows containing missing values (geom_point).

Which plot is easier to interpret? explain.

Answer- the easiest graph to analize is the log one- beacuse the behavior of the functions are same and its easy to see the diffrent in each date on the graph, They are look the same so we can analyse them same Similarly, make two additional separate plots for the log of the smoothed number of new deaths and new vaccinations per million as a function of date for the high vs. low income countries. Describe the plotted results.

describing the grahs.- as we can see, high income location died more(by log func) and did more vaccinations.

covid_19$log_new_deaths_per_mill = log(covid_19$new_deaths_per_million)
low_vs_high_comp(covid_19, "log_new_deaths_per_mill" )

covid_19$log_vac_per_mill = log(covid_19$new_vaccinations_smoothed_per_million)
low_vs_high_comp(covid_19, "log_vac_per_mill")
## Warning: Removed 680 rows containing missing values (geom_point).

  1. Storing and analyzing current (cumulative) values:
    1. Create a new data-frame called current with one row per country (and other locations), that for each country will store as columns the country name (location) and continent, and also the current values (latest date reported for each value) for: `,total_deaths_per_million,total_vaccinations_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundredandexcess_mortality_cumulative_per_million`.
current <-  subset(covid_19, select = c(location, date,total_vaccinations_per_hundred, people_fully_vaccinated_per_hundred, total_boosters_per_hundred, excess_mortality_cumulative_per_million, continent, total_deaths_per_million, total_cases_per_million  ))
#creating new DS
current <- fill(current,everything())
#fixing the N.A
current <- subset(current, date == max(date), c("location", "continent","total_deaths_per_million", "total_vaccinations_per_hundred", "people_fully_vaccinated_per_hundred", "total_boosters_per_hundred", "excess_mortality_cumulative_per_million", "total_cases_per_million" ))
#comulate the the details to one row for each country.

head(current)
##         location continent total_deaths_per_million
## 792  Afghanistan      Asia                  192.869
## 1595      Africa                            184.074
## 2386     Albania    Europe                 1216.874
## 3177     Algeria    Africa                  154.091
## 3962     Andorra    Europe                 1977.920
## 4729      Angola    Africa                   55.992
##      total_vaccinations_per_hundred people_fully_vaccinated_per_hundred
## 792                           14.93                               11.56
## 1595                          34.53                               16.16
## 2386                          97.51                               42.68
## 3177                          30.72                               13.70
## 3962                         196.89                               69.04
## 4729                          55.28                               18.06
##      total_boosters_per_hundred excess_mortality_cumulative_per_million
## 792                          NA                                      NA
## 1595                       1.48                                      NA
## 2386                       9.97                                5101.057
## 3177                       1.10                                1118.164
## 3962                      53.06                                1158.311
## 4729                       0.98                                1158.311
##      total_cases_per_million
## 792                 4487.086
## 1595                8477.007
## 2386               95661.091
## 3177                5956.770
## 3962              530198.826
## 4729                2925.919

for question 3c we added to current the column of total_cases_per_million.

b. Show the values for the current `total_deaths_per_million` in different countries in a histogram with 30 bins. Does this histogram look close to the normal distribution? 

Compute the skewness and kurtosis for this distribution, and explain what they mean about the empirical distribution of the data.

skewness(current$total_deaths_per_million, na.rm = T)
## [1] 1.306257
histo_deahth<- hist(current$total_deaths_per_million, breaks=30)

kurtosis(current$total_deaths_per_million, na.rm = T)
## [1] 1.642502

Skewness value are possitive so we have right distortion kurtosis showing us th level of outliers and here is 1.642 .

dosnt look like normal dist.

c. Next, make a scatter plot showing the current `total_deaths_per_million` (y-axis) vs. the current `total_cases_per_million`. Compute a linear regression line of the total number of deaths per million as a function of the total number of cases per million and add the fitted regression line to the plot. What is the slope and what does it represent? 
plot(current$total_cases_per_million, current$total_deaths_per_million, xlab = "Total cases per million", ylab = "Total deaths per million")
reg_line = lm(current$total_deaths_per_million ~ current$total_cases_per_million)
abline(reg_line, col = "red")

summary(reg_line)
## 
## Call:
## lm(formula = current$total_deaths_per_million ~ current$total_cases_per_million)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2821.6  -585.0  -348.5   519.1  5350.3 
## 
## Coefficients:
##                                    Estimate  Std. Error t value
## (Intercept)                     606.5936124  95.5122157   6.351
## current$total_cases_per_million   0.0039429   0.0004682   8.421
##                                            Pr(>|t|)    
## (Intercept)                     0.00000000116178243 ***
## current$total_cases_per_million 0.00000000000000433 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1055 on 226 degrees of freedom
## Multiple R-squared:  0.2388, Adjusted R-squared:  0.2355 
## F-statistic: 70.92 on 1 and 226 DF,  p-value: 0.000000000000004325

‘’As we can see visually, there is a weak positive correlation between the number of cases and number of deaths. We can take the R^2 value we are exposed to in the summary function and by the root of it the get the corr between the two variables.’’

  1. Vaccinations in different continents:
    1. For each of the six different continents (Africa, Asia, Europe, North America, Oceania, South America), make a boxplot of the distribution of the (current) total number of vaccinations per hundred in all the countries in the continent. Show one figure with the six boxplots next to each other. (Use the new current data-frame created in qu. 3). Find two outlier countries (can be of any continent) and write their name and value.
# Using box plot func
box_continent =  current [, c("continent","total_vaccinations_per_hundred")] %>% filter(continent != "")
a = boxplot(total_vaccinations_per_hundred ~ continent ,
data= box_continent,
main="Disterbution by continent",
xlab="continet",
ylab="total vac per 100",
col="deepskyblue4",
border="brown"

)

#bring 2 outlier counries
current %>% top_n(2, total_vaccinations_per_hundred) 
##    location     continent total_deaths_per_million
## 1      Cuba North America                  753.258
## 2 Gibraltar        Europe                 3027.515
##   total_vaccinations_per_hundred people_fully_vaccinated_per_hundred
## 1                         315.20                               87.84
## 2                         355.75                              122.94
##   total_boosters_per_hundred excess_mortality_cumulative_per_million
## 1                      57.66                                43.75525
## 2                     107.92                               581.75774
##   total_cases_per_million
## 1                97368.52
## 2               525540.95
b. Define (for the original data-frame from qu. 1) a new column called `booster_ratio`, that lists for each date the fraction of individuals that got a third, booster shot in a country (`total_boosters`), out of all individuals that got two shots (`people_fully_vaccinated`), by dividing the two columns (if either of them is `NA` or if the denominator is zero, set `booster_ratio` to `NA`).

Plot the booster_ratio as a function of time for the six continents (on the same plot, using different colors) and describe the results.

#creating new col and also avoiding dividing in zero
covid_19$booster_ratio = ifelse(covid_19$people_fully_vaccinated > 0, covid_19$total_boosters / covid_19$people_fully_vaccinated, NA)
box_continent_2 =  covid_19 [, c("continent","booster_ratio", "date")] %>% filter(continent != "")
#Using smooth ggplot to make the graph more clear to veiw.
ggplot(box_continent_2, aes(x = date, y= booster_ratio)) + 
  geom_smooth(aes( color= (continent))) 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 154690 rows containing non-finite values (stat_smooth).

#As we can see most of the it takes time until the ratio upgrade, and it happend in all the continents, Africa stay behind buy has also some raising.
  1. Displaying monthly data:
    1. For each month from January 2020 until March 2022 compute the total number of new cases per million in this month for each country individually (around 230 numbers for each month), and make a boxplot representing the empirical distribution fo these ~230 values. Show the boxplots for all months in one plot together in a chronological order, from left to right.
month_ds =  subset(covid_19, date> "2019-12-31" & date < "2022-04-01" )
month_ds = month_ds[,c("location", "date", "new_cases_per_million" )]
month_ds = month_ds %>% group_by(date= format(as.Date(date),"%y-%m")) %>% group_by(date,location) %>% summarise(new_cases_per_million_month = sum(new_cases_per_million))
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
boxplot( new_cases_per_million_month ~ date ,
data= month_ds,
main=" Empirical distribution fo these",
xlab="Month",
ylab="new cases",
col="green",
border="orange"
)

    *Guidance:* (i) Beware to not double-count cases/deaths/vaccinations. (ii) Treat each month separately (e.g. March 2020 and March 2022 are different).

b. Repeat (a.), but this time with the total number of `new_deaths` and `new_vaccinations` for each month (two separate plots). 
month_ds_death =  subset(covid_19, date> "2019-12-31" & date < "2022-04-01" )
month_ds_death = month_ds_death[,c("location", "date", "new_deaths" )]
month_ds_death = month_ds_death %>% group_by(date= format(as.Date(date),"%y-%m")) %>% group_by(date,location) %>% summarise(new_deaths_month = sum(new_deaths))
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
boxplot(new_deaths_month ~ date ,
data= month_ds_death,
main=" new deaths month dist",
xlab="Month",
ylab="new deaths",
col="blue",
border="black")

month_ds_vacc =  subset(covid_19, date> "2019-12-31" & date < "2022-04-01" )
month_ds_vacc = month_ds_vacc[,c("location", "date", "new_vaccinations" )]
month_ds_vacc = month_ds_vacc %>% group_by(date= format(as.Date(date),"%y-%m")) %>% group_by(date,location) %>% summarise(new_vacc_month = sum(new_vaccinations))
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
boxplot(new_vacc_month ~ date ,
data= month_ds_vacc,
main=" new vacc per month dist",
xlab="Month",
ylab="new vacc in a month",
col="blue",
border="black")

    What can you conclude about the pandemic at different periods from these plots? describe the results for each of the three plots in 2-3 sentences. 

WE can see that between MAy to January it was increase in deatts rate in corelation in with vaccination rate.

  1. R - disease reproduction number: Add to the covid data-frame a new column called R_cases, defined for location and date as the number of new_cases_smoothed at this date, divided by the value of the same variable seven days before (if the value seven days before is zero or not defined, define R_cases at the current date to be NA). This column represents roughly the expected number of people that each case infects, and determines the spread of the disease, with values above (below) one indicating that the disease is spreading (declining).
r_case_df = covid_19[,c("location", "date", "new_cases_smoothed_per_million")]
r_case_df = r_case_df %>% group_by(location) %>% summarise(R_cases = new_cases_smoothed_per_million / lag(new_cases_smoothed_per_million, n=7), date)
## `summarise()` has grouped output by 'location'. You can override using the
## `.groups` argument.
r_case_df$R_cases[is.infinite(r_case_df$R_cases)] = NA
covid_19$R_cases = r_case_df$R_cases
Plot the `R_cases` value as a function of time for `Israel`, `United Kingdom` and `United States`, and describe the results. <br>

List in a table the number of days at which the disease was spreading (value above 1) in each of the three countries.

country_data =  covid_19 [, c("location", "date", "R_cases")] %>%  subset( location == "Israel" |location == "United States"| location == "United Kingdom")
ggplot(country_data, aes(x = date, y= R_cases)) + 
  geom_line(aes( color= (location))) 
## Warning: Removed 38 row(s) containing missing values (geom_path).

country_data <- fill(country_data,everything())

country_data_2 = country_data %>% group_by(location) %>% filter(R_cases >=1) 

r_isr = length(which(country_data_2$location == "Israel"))
r_US = length(which(country_data_2$location == "United States"))
r_UK = length(which(country_data_2$location == "United Kingdom"))

table_r = c(Israel = r_isr, UK = r_UK, US = r_US)

table_r
## Israel     UK     US 
##    403    446    420
  1. Displaying data on the world map: Use the rworldmap package to display the world map and color each country based on the total_deaths_per_million. Repeat for total_vaccinations_per_hundred, and excess_mortality_cumulative_per_million. Describe the resulting maps in a couple of sentences.

    List the top three countries for each of these variables in a table.

    Guidance: Use the joinCountryData2Map and mapCountryData commands to make the plots. Keep countries with missing data in white.

world_current =  joinCountryData2Map(current, joinCode = "NAME", nameJoinColumn = "location")
## 206 codes from your data successfully matched countries in the map
## 22 codes from your data failed to match with a country code in the map
## 37 codes from the map weren't represented in your data
mapCountryData(world_current, nameColumnToPlot = "total_deaths_per_million")

current %>% top_n(3, total_deaths_per_million) %>% select(location, total_deaths_per_million )
##                 location total_deaths_per_million
## 1 Bosnia and Herzegovina                 4829.845
## 2               Bulgaria                 5344.040
## 3                   Peru                 6377.840
mapCountryData(world_current, nameColumnToPlot = "total_vaccinations_per_hundred")

current %>% top_n(3, total_vaccinations_per_hundred) %>% select(location, total_vaccinations_per_hundred )
##    location total_vaccinations_per_hundred
## 1     Chile                         271.92
## 2      Cuba                         315.20
## 3 Gibraltar                         355.75
mapCountryData(world_current, nameColumnToPlot = "excess_mortality_cumulative_per_million")

current %>% top_n(3, total_vaccinations_per_hundred) %>% select(location, excess_mortality_cumulative_per_million )
##    location excess_mortality_cumulative_per_million
## 1     Chile                              2563.12576
## 2      Cuba                                43.75525
## 3 Gibraltar                               581.75774
  1. Cross correlations and delay from diagnosis to death: We want to use the data and cross-correlation in order to study the typical time delay between diagnosis of Covid-19 and death from Covid-19 for cases not surviving the disease. For two functions of time \(X(t)\) and \(Y(t)\) (here \(t\) is discrete, representing for example days) we define their cross-correlation at time-delay \(\Delta_t\) as follows: \(cross_{corr}(\Delta_t ; X, Y) = Corr(X(t), Y(t+\Delta_t))\).
    That is, the cross-correlation function at the time-delay \(\Delta_t\) for two vectors of length \(n\) is obtained by computing the Pearson correlation coefficient of the vector \(X[1,...,n-\Delta_t]\) with the vector \(Y[\Delta_t+1,...,n]\), for \(\Delta_t>0\). For \(\Delta_t < 0\) we replace the role of \(X\) and \(Y\) in this formula.

    1. Write a function that receives as input the data-frame, a country name and the name of two columns, and computes the value of their cross-correlation in this country for time delay of up to two months apart, that is for all values of \(\Delta_t\) between \(-60\) days and \(60\) days. The function should return a vector of length \(121\) representing these correlations. Do not use available cross-correlation functions in \(R\), but write the function yourself.
cross_corr <- function(corr_df, country, col1, col2){
  new_vec <- c() #creating new vector for the cross correlations we will calculate
  filter_var <- corr_df%>%filter(location == country)
  
  col1 <- as.numeric(as.list(t(filter_var[,col1])))
  col2 <- as.numeric(as.list(t(filter_var[,col2])))
  # Defining the columns as Transpose in order to calculate the corr, setting the country for each one.
  for (i in c(-60:60)) {
    
    if (i< 0){new_vec <- append(new_vec, cor(col1[(abs(i) + 1):length(col1)], col2[1:(length(col1)-abs(i))], use = "pairwise.complete.obs"))}
    
    else if (i>=0){new_vec <- append(new_vec, cor(col2[(i+1):length(col1)], col1[1:(length(col1)-i)], use = "pairwise.complete.obs"))}
    #corr(x[1,...1-delta(t)],y[1+delta(t),..n]), and omitting NA's.
    
    }
       
  return(new_vec) 
  
}

World_cross_corr <- cross_corr(covid_19,"World","new_cases_smoothed","new_deaths_smoothed")
b. Use the function from (a.) to compute the cross correlation between the number of `new_cases_smoothed` and `new_deaths_smoothed` for the entire *World*, and plot it as a function of $\Delta_t$. 

At what time delay is the cross correlation maximized? what is your interpretation of this time-delay?

World_cross_corr <- cross_corr(covid_19,"World","new_cases_smoothed","new_deaths_smoothed")


interval <- (c(-60:60))

plot(x=interval,y=World_cross_corr ,xlab = "Delta",ylab = "Cross corr",main = "Cross-Correlation for the world vs.Delta")

  1. Death risk after Covid-19 infection:
    1. Add to the main covid data-frame a new column called death_rate, defined for each location and date as the number of total_deaths divided by the number of total_cases. This column represents the risk of a person diagnosed with covid to die from the disease. Plot for each of the six continents and the entire world the death_rate as a function of time (one plot with separate colors/symbols). Since there is a delay from time of diagnosis to time of death, the initial values of this column are less reliable as a measure of death risk, hence start your plot on January 1st, 2021. Do we see a decrease in the risk over time? can you suggest explanations for the observed trends?
covid_19$death_rate = ifelse(covid_19$total_cases >0, covid_19$total_deaths / covid_19$total_cases, NA)

box_continent_2 =  covid_19 [, c("continent","death_rate", "date")] %>% filter(continent != "")
box_continent_2 =  subset(covid_19, date> "2020-12-31" & date < "2022-04-01" )

ggplot(box_continent_2, aes(x = date, y= death_rate)) + 
  geom_smooth(aes( color= (continent))) 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 11503 rows containing non-finite values (stat_smooth).

There is a huge decrease between those dates. maybe we can asume its a wave of corona that just passed or there is an influence by another parameter like vaccination.

b. Make a similar plot for all continents and the world, but for the `total_vaccinations_per_hundred` variable. Do the plots suggest that the change in risk is correlated to the change in the number of vaccinations?
box_continent_3 =  subset(covid_19, date> "2020-12-31" & date < "2022-04-01" )

ggplot(box_continent_3, aes(x = date, y= total_vaccinations_per_hundred)) + 
  geom_smooth(aes( color= (continent))) 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 60832 rows containing non-finite values (stat_smooth).

as we can see there is negative correlation between vaccinations and deaths. as much as more people get the vaccinations, less people dying.

  1. Excess mortality: It is claimed that in some countries the death-rate from Covid-19 is under-reported. We can investigate this using excess mortality data - this data lists the number of deaths from all causes at each country, compared to the usual number of deaths at the same country and date calculated from previous years mortality data. The gap between the two, representing and increase (or decrease) in deaths beyond the usual rate, is called ‘excess mortality’, and might be partially or fully explained by deaths from Covid-19.
    1. Use the current data-frame to make a scatter plot of the current date total_deaths_per_million vs. the excess_mortality_cumulative_per_million for all countries for which excess mortality data is available. Add the lines \(y=x, y=x+2000\) and \(y=x-2000\) to the plot. Mark on the graph in a different color all the countries for which the difference between the the excess mortality and the covid death rate (per million) is at least 2000 and add their names to the plot (you can use the text function).
ggplot(current, aes(x=total_deaths_per_million, y=excess_mortality_cumulative_per_million)) + geom_point(aes(color = (abs(excess_mortality_cumulative_per_million-total_deaths_per_million)>=2000)), show.legend = F) + geom_abline(intercept = 0, slope = 1) + geom_abline(intercept = 2000, slope = 1, col = "gray") + geom_abline(intercept = -2000, slope = 1, col = "brown") + geom_text(aes(label=ifelse(abs(excess_mortality_cumulative_per_million-total_deaths_per_million)>=2000, location, " ")), size = 2.5)
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_text).

b. Pick three countries where the excess mortality was at least 2000 per million *above* the covid death rate, and that have at least $50$ available data points for each for `excess_mortality_cumulative_per_million`. 

Use the main covid data-frame to plot as a function of time both the total_deaths_per_million and the excess_mortality_cumulative_per_million for each country (one plot - use different colors/symbols). Identify from the plot for each country the time periods where most deaths not explained by Covid-19 occurred.

ex_mor <- subset(covid_19, location == c("Lithuania","", "Romania", "Bulgaria") , select = c("location", "date", "death_rate", "total_deaths_per_million","excess_mortality_cumulative_per_million"))     
## Warning in location == c("Lithuania", "", "Romania", "Bulgaria"): longer object
## length is not a multiple of shorter object length
ex_mor_plot <- ggplot(data = ex_mor) + geom_smooth(aes(x = date, y = excess_mortality_cumulative_per_million, colour = location)) + geom_smooth(aes(x = date, y = total_deaths_per_million, colour = location)) 

ex_mor_plot
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 508 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (stat_smooth).